Deep Music Genre Classification

Natural Language Processing

Neural Networks

Deep Learning

Deep Music Genre Classification in Python

Author

Lukka Wolff

Published

May 12, 2025

Abstract

In this blog post, we explore a deep learning approach to predicting the genres of music tracks. We leverage both song lyrics and engineered metadata features. We tokenize the lyrics with a BERT tokenizer and make use of Spotify’s engineered audio–semantic features (e.g., acousticness, danceability, thematic tags). We implement three neural networks: a lyric-based model, a metadata-only network, and a combined network that uses both lyric embeddings and engineered features. We compare how our different models stack up against one another and our base rate to assess the success of our different approaches to genre prediction.

Data

import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader

from torchinfo import summary

import pandas as pd
import numpy as np
import time

# for train-test split
from sklearn.model_selection import train_test_split

# for suppressing bugged warnings from torchinfo
import warnings
warnings.filterwarnings("ignore", category = UserWarning)

# tokenizers from HuggingFace
from transformers import BertTokenizer

# for building condensed vocab sets
# from torchtext.vocab import build_vocab_from_iterator

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

c:\Users\lukka\anaconda3\envs\ml-0451\lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

We are loading in a Kaggle dataset that contains information about music made between the years 1950 and 2019 collected through Spotify. The dataset contains lyrics, artist info, track names, etc. Importantly it also includes music metadata like sadness, danceability, loudness, acousticness, etc.

url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/tcc_ceds_music.csv"
df = pd.read_csv(url)

Lets have a look at some of the raw data!

df.head()

	Unnamed: 0	artist_name	track_name	release_date	genre	lyrics	len	dating	violence	world/life	...	sadness	feelings	danceability	loudness	acousticness	instrumentalness	valence	energy	topic	age
0	0	mukesh	mohabbat bhi jhoothi	1950	pop	hold time feel break feel untrue convince spea...	95	0.000598	0.063746	0.000598	...	0.380299	0.117175	0.357739	0.454119	0.997992	0.901822	0.339448	0.137110	sadness	1.0
1	4	frankie laine	i believe	1950	pop	believe drop rain fall grow believe darkest ni...	51	0.035537	0.096777	0.443435	...	0.001284	0.001284	0.331745	0.647540	0.954819	0.000002	0.325021	0.263240	world/life	1.0
2	6	johnnie ray	cry	1950	pop	sweetheart send letter goodbye secret feel bet...	24	0.002770	0.002770	0.002770	...	0.002770	0.225422	0.456298	0.585288	0.840361	0.000000	0.351814	0.139112	music	1.0
3	10	pérez prado	patricia	1950	pop	kiss lips want stroll charm mambo chacha merin...	54	0.048249	0.001548	0.001548	...	0.225889	0.001548	0.686992	0.744404	0.083935	0.199393	0.775350	0.743736	romantic	1.0
4	12	giorgos papadopoulos	apopse eida oneiro	1950	pop	till darling till matter know till dream live ...	48	0.001350	0.001350	0.417772	...	0.068800	0.001350	0.291671	0.646489	0.975904	0.000246	0.597073	0.394375	romantic	1.0

5 rows × 31 columns

Here is a brief look at how many songs we have in each represented genre.

df.groupby("genre").size()

genre
blues      4604
country    5445
hip hop     904
jazz       3845
pop        7042
reggae     2498
rock       4034
dtype: int64

This is a pretty large number of songs to classify… and some genres I personally dont care for. So, to make the dataframe more manageable and applicable to me personally, we are going to narrow down to only observe reggae, hip hop, rock and jazz.

genres = {
    "hip hop"   : 0,
    "jazz" : 1,
    "reggae" : 2,
    "rock" : 3,
}

df = df[df["genre"].apply(lambda x: x in genres.keys())]
df.head()

	Unnamed: 0	artist_name	track_name	release_date	genre	lyrics	len	dating	violence	world/life	...	sadness	feelings	danceability	loudness	acousticness	instrumentalness	valence	energy	topic	age
17091	54304	gene ammons	it's the talk of the town	1950	jazz	lovers sweethearts hard understand know happen...	61	0.001096	0.001096	0.001096	...	0.319570	0.001096	0.352323	0.620388	0.868474	0.235830	0.430132	0.282260	sadness	1.0
17092	54305	gene ammons	you go to my head	1950	jazz	head linger like haunt refrain spin round brai...	48	0.001754	0.340964	0.001754	...	0.001754	0.001754	0.379400	0.638541	0.907630	0.900810	0.221970	0.184159	violence	1.0
17093	54307	bud powell	yesterdays	1950	jazz	music speak start hear musicians like dizzy gi...	107	0.001144	0.001144	0.074762	...	0.001144	0.097082	0.489873	0.467400	0.992972	0.927126	0.334295	0.228204	music	1.0
17094	54311	tony bennett	stranger in paradise	1950	jazz	hand stranger paradise lose wonderland strange...	41	0.002105	0.180524	0.002105	...	0.527429	0.002105	0.179032	0.559470	0.983936	0.001781	0.086974	0.235211	sadness	1.0
17095	54313	dean martin	zing-a zing-a zing boom	1950	jazz	zinga zinga zinga zinga zinga zinga zinga zing...	160	0.001253	0.001253	0.001253	...	0.425721	0.001253	0.580851	0.687409	0.655622	0.000000	0.936109	0.418400	sadness	1.0

5 rows × 31 columns

df["genre"] = df["genre"].apply(genres.get)
df

	Unnamed: 0	artist_name	track_name	release_date	genre	lyrics	len	dating	violence	world/life	...	sadness	feelings	danceability	loudness	acousticness	instrumentalness	valence	energy	topic	age
17091	54304	gene ammons	it's the talk of the town	1950	1	lovers sweethearts hard understand know happen...	61	0.001096	0.001096	0.001096	...	0.319570	0.001096	0.352323	0.620388	0.868474	0.235830	0.430132	0.282260	sadness	1.000000
17092	54305	gene ammons	you go to my head	1950	1	head linger like haunt refrain spin round brai...	48	0.001754	0.340964	0.001754	...	0.001754	0.001754	0.379400	0.638541	0.907630	0.900810	0.221970	0.184159	violence	1.000000
17093	54307	bud powell	yesterdays	1950	1	music speak start hear musicians like dizzy gi...	107	0.001144	0.001144	0.074762	...	0.001144	0.097082	0.489873	0.467400	0.992972	0.927126	0.334295	0.228204	music	1.000000
17094	54311	tony bennett	stranger in paradise	1950	1	hand stranger paradise lose wonderland strange...	41	0.002105	0.180524	0.002105	...	0.527429	0.002105	0.179032	0.559470	0.983936	0.001781	0.086974	0.235211	sadness	1.000000
17095	54313	dean martin	zing-a zing-a zing boom	1950	1	zinga zinga zinga zinga zinga zinga zinga zing...	160	0.001253	0.001253	0.001253	...	0.425721	0.001253	0.580851	0.687409	0.655622	0.000000	0.936109	0.418400	sadness	1.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
28367	82447	mack 10	10 million ways	2019	0	cause fuck leave scar tick tock clock come kno...	78	0.001350	0.001350	0.001350	...	0.065664	0.001350	0.889527	0.759711	0.062549	0.000000	0.751649	0.695686	obscene	0.014286
28368	82448	m.o.p.	ante up (robbin hoodz theory)	2019	0	minks things chain ring braclets yap fame come...	67	0.001284	0.001284	0.035338	...	0.001284	0.001284	0.662082	0.789580	0.004607	0.000002	0.922712	0.797791	obscene	0.014286
28369	82449	nine	whutcha want?	2019	0	get ban get ban stick crack relax plan attack ...	77	0.001504	0.154302	0.168988	...	0.001504	0.001504	0.663165	0.726970	0.104417	0.000001	0.838211	0.767761	obscene	0.014286
28370	82450	will smith	switch	2019	0	check check yeah yeah hear thing call switch g...	67	0.001196	0.001196	0.001196	...	0.001196	0.001196	0.883028	0.786888	0.007027	0.000503	0.508450	0.885882	obscene	0.014286
28371	82451	jeezy	r.i.p.	2019	0	remix killer alive remix thriller trap bitch s...	83	0.001012	0.075202	0.001012	...	0.001012	0.033995	0.828875	0.674794	0.015862	0.000000	0.475474	0.492477	obscene	0.014286

11281 rows × 31 columns

The base rate on our classification is the proportion of the data set occupied by the largest label class:

df.groupby("genre").size() / len(df)

genre
0    0.080135
1    0.340839
2    0.221434
3    0.357592
dtype: float64

If we always guessed category 3, then we would expect an accuracy of roughly 36%. So, our task will be to see whether we can train a model to beat this.

As we try to predict the genre of the track, we will use lyrics alongside some other engineered features (metadata) that we define below.

engineered_features = ['dating', 'violence', 'world/life', 'night/time','shake the audience','family/gospel', 'romantic', 'communication','obscene', 'music', 'movement/places', 'light/visual perceptions','family/spiritual', 'like/girls', 'sadness', 'feelings', 'danceability','loudness', 'acousticness', 'instrumentalness', 'valence', 'energy']

Our models will only need these engineered features, lyrics, and our target value which will be genre so we can throw them all into the same dataframe and use slicing to access different parts later.

df_clean= df[engineered_features + ['lyrics', 'genre']].copy()
df_clean.head()

	dating	violence	world/life	night/time	shake the audience	family/gospel	romantic	communication	obscene	music	...	sadness	feelings	danceability	loudness	acousticness	instrumentalness	valence	energy	lyrics	genre
17091	0.001096	0.001096	0.001096	0.001096	0.036316	0.001096	0.001096	0.460773	0.086498	0.001096	...	0.319570	0.001096	0.352323	0.620388	0.868474	0.235830	0.430132	0.282260	lovers sweethearts hard understand know happen...	1
17092	0.001754	0.340964	0.001754	0.001754	0.001754	0.001754	0.131872	0.001754	0.001754	0.001754	...	0.001754	0.001754	0.379400	0.638541	0.907630	0.900810	0.221970	0.184159	head linger like haunt refrain spin round brai...	1
17093	0.001144	0.001144	0.074762	0.046173	0.001144	0.018789	0.001144	0.001655	0.001144	0.421734	...	0.001144	0.097082	0.489873	0.467400	0.992972	0.927126	0.334295	0.228204	music speak start hear musicians like dizzy gi...	1
17094	0.002105	0.180524	0.002105	0.002105	0.002105	0.002105	0.002105	0.201965	0.002105	0.002105	...	0.527429	0.002105	0.179032	0.559470	0.983936	0.001781	0.086974	0.235211	hand stranger paradise lose wonderland strange...	1
17095	0.001253	0.001253	0.001253	0.001253	0.001253	0.081126	0.001253	0.111951	0.001253	0.268737	...	0.425721	0.001253	0.580851	0.687409	0.655622	0.000000	0.936109	0.418400	zinga zinga zinga zinga zinga zinga zinga zing...	1

5 rows × 24 columns

Finally, we will perform a train-validation split to later evaluate our data

df_train, df_val = train_test_split(df_clean,shuffle = True, test_size = 0.2)

Text Vectorization

We now need to vectorize the lyrics. We’re going to use tokenization to break up the lyrics into a sequence of tokens, and then vectorize that sequence.

We will be using a tokenizer imported from HuggingFace.

tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")

For our purposes it’s more convenient to assign an integer to each token, which we can do like this:

encoded = tokenizer("I love reggae music!")
encoded

{'input_ids': [101, 1045, 2293, 15662, 2189, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

To do the reverse, we can use the .decode method of the tokenizer:

tokenizer.decode(encoded["input_ids"])

'[CLS] i love reggae music! [SEP]'

Here is some code to help us prepare our dataset with encodings. A lot of our lyrics are different lengths so we will pad the shorter ones with 0s and truncate others that are especially long. We will make use of the torch Dataset class to help manage our data.

max_len = 512 # BERT capacity

def preprocess(df, tokenizer, max_len):
    lyrics_tokens = tokenizer(list(df["lyrics"]), padding="max_length", truncation=True, max_length=max_len)["input_ids"]
    engineered = df[engineered_features].values.tolist()
    y = list(df["genre"])
    return lyrics_tokens, engineered, y

class TextDataFromDF(Dataset):
    def __init__(self, df):
        self.lyrics_tokens, self.engineered_feats, self.y = preprocess(df, tokenizer, max_len)

    def __getitem__(self, ix):
        return self.lyrics_tokens[ix], self.engineered_feats[ix], self.y[ix]

    def __len__(self):
        return len(self.y)

Lets make our encoded datasets!

train_data = TextDataFromDF(df_train)
val_data   = TextDataFromDF(df_val)

Here is what a single songs information looks like now:

X_tokens, X_feats, y = train_data[1]
print(X_tokens, X_feats)
print(y)

[101, 2372, 2113, 21209, 6887, 16585, 2477, 2111, 8501, 3613, 9266, 2213, 9680, 2444, 9152, 23033, 2015, 10675, 2015, 4401, 2991, 4533, 4952, 11898, 10432, 12170, 9102, 6510, 8081, 4485, 2729, 10667, 14033, 6510, 2131, 2477, 2175, 4485, 2131, 2518, 3861, 2272, 2420, 2208, 16371, 4246, 7047, 8046, 4485, 2215, 4485, 4248, 4355, 7281, 7579, 6841, 16360, 8091, 4485, 4982, 4503, 14255, 23344, 2227, 2131, 3947, 3238, 2444, 11565, 10020, 2102, 3305, 2514, 2665, 3259, 2192, 2518, 2903, 2066, 8554, 10421, 7200, 5223, 2342, 2757, 11274, 4372, 14540, 10696, 2111, 2219, 3828, 2111, 13660, 3240, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] [0.0011961722985867, 0.1300521675514939, 0.1951359532512352, 0.0011961722561089, 0.0011961723204168, 0.0011961722655141, 0.0011961722799619, 0.1710064773999869, 0.3944457875450848, 0.0011961722772403, 0.0011961723013686, 0.0011961723181082, 0.0522687272336512, 0.0011961722965877, 0.0011961723815529, 0.0415406469256242, 0.4476334885735947, 0.7206368740866087, 0.0736938490902099, 0.0, 0.6589035449299256, 0.6446335461127515]
2

We are going to be feeding data in in batches, so we will need a dataloader which necessitates a collate function to ensure our we are imputing tensors of the right size.

def collate(data):
    tokens = torch.tensor([d[0] for d in data], dtype=torch.long)
    engineered = torch.tensor([d[1] for d in data], dtype=torch.float)
    y = torch.tensor([d[2] for d in data], dtype=torch.long)
    return (tokens, engineered), y

train_loader = DataLoader(train_data, batch_size=8, shuffle=True, collate_fn = collate)
val_loader = DataLoader(val_data, batch_size=8, shuffle=True, collate_fn = collate)

Here is what a batch of data looks like. The predictor data is now a tensor in which the entries give token indices, padded with 0s and another tensor with the values of our engineered features. For visualization purposes we’ll show only the first 2 rows:

X, y = next(iter(train_loader))
X[:2]

(tensor([[  101,  2621,  4553,  ...,     0,     0,     0],
         [  101,  2668, 14740,  ...,     0,     0,     0],
         [  101,  2305,  2272,  ...,     0,     0,     0],
         ...,
         [  101,  2051,  2621,  ...,     0,     0,     0],
         [  101,  2051,  2202,  ...,     0,     0,     0],
         [  101,  5949,  2773,  ...,     0,     0,     0]]),
 tensor([[2.5063e-03, 2.5063e-03, 3.3226e-01, 9.8139e-02, 2.5063e-03, 2.5063e-03,
          2.5063e-03, 2.5063e-03, 2.5063e-03, 1.2809e-01, 2.5063e-03, 2.5063e-03,
          2.5063e-03, 2.8335e-01, 2.5063e-03, 2.5063e-03, 5.5594e-01, 7.7276e-01,
          2.0883e-02, 1.1235e-05, 2.8174e-01, 4.7846e-01],
         [1.8797e-03, 5.0473e-01, 1.8797e-03, 1.8797e-03, 3.7594e-02, 3.9971e-02,
          1.8797e-03, 1.8797e-03, 1.8797e-03, 1.8797e-03, 1.8797e-03, 1.8797e-03,
          1.3261e-01, 1.8797e-03, 2.2307e-01, 1.8797e-03, 8.1155e-01, 7.5999e-01,
          8.7248e-02, 2.5304e-03, 5.6513e-01, 6.5364e-01],
         [1.9493e-03, 1.9493e-03, 3.4622e-01, 1.9493e-03, 1.9493e-03, 1.9493e-03,
          1.9493e-03, 1.9493e-03, 1.9493e-03, 1.9493e-03, 2.8898e-01, 3.3361e-01,
          1.9493e-03, 1.9493e-03, 1.9493e-03, 1.9493e-03, 6.1118e-01, 5.5086e-01,
          9.5181e-01, 1.8725e-01, 3.4151e-01, 2.2320e-01],
         [5.7208e-04, 5.7208e-04, 5.1883e-02, 5.7208e-04, 6.8732e-02, 2.8145e-02,
          5.7208e-04, 2.5657e-01, 3.6477e-01, 5.7208e-04, 2.2247e-01, 5.7208e-04,
          5.7208e-04, 5.7208e-04, 5.7208e-04, 5.7208e-04, 7.4764e-01, 6.9715e-01,
          1.6867e-01, 0.0000e+00, 8.4233e-01, 4.8046e-01],
         [1.9611e-02, 3.9300e-01, 2.4161e-01, 8.9206e-04, 1.8294e-02, 7.0961e-02,
          8.9206e-04, 8.9206e-04, 8.9206e-04, 8.9206e-04, 8.9206e-04, 1.4598e-01,
          4.5708e-02, 5.5025e-02, 8.9206e-04, 8.9206e-04, 8.0180e-01, 6.7451e-01,
          5.6827e-05, 9.4838e-01, 9.0210e-01, 7.9379e-01],
         [1.0526e-03, 1.0526e-03, 1.0526e-03, 3.2379e-01, 1.0526e-03, 1.0526e-03,
          9.0729e-02, 1.7966e-01, 1.0526e-03, 1.0713e-01, 1.0526e-03, 2.5740e-01,
          1.0526e-03, 1.0526e-03, 1.0526e-03, 2.7609e-02, 2.0286e-01, 6.0398e-01,
          9.4478e-01, 5.6883e-05, 5.1731e-02, 2.8426e-01],
         [1.5038e-03, 1.8747e-01, 1.5038e-03, 1.6175e-01, 1.5038e-03, 1.5038e-03,
          1.5038e-03, 1.5038e-03, 4.1537e-02, 1.5038e-03, 1.5038e-03, 1.5038e-03,
          1.5038e-03, 1.5038e-03, 5.2697e-01, 1.5038e-03, 5.4403e-01, 7.8607e-01,
          6.8374e-06, 8.3806e-01, 3.3739e-01, 9.4795e-01],
         [2.9240e-03, 2.9240e-03, 2.9240e-03, 1.0013e-01, 2.9240e-03, 2.9240e-03,
          2.9240e-03, 2.9240e-03, 2.9240e-03, 1.7470e-01, 2.9240e-03, 2.9240e-03,
          2.9240e-03, 9.0332e-02, 4.7758e-01, 2.9240e-03, 6.3609e-01, 7.2769e-01,
          5.7932e-01, 3.3603e-01, 6.4963e-01, 7.6075e-01]]))

y[:2]

tensor([3, 2])

Model Building

We are going to train three neural networks to classify our genres.

Using Lyrics to Classify
Using Engineered Features (Metadata) to Classify
Using Lyrics and Metadata to Classify

Lets build a model for classifying genres based on lyrics first.

Lyrical Classification

class TextClassificationModel(nn.Module):

    def __init__(self,vocab_size, embedding_dim, max_len, num_class):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size+1, embedding_dim)
        self.dropout = nn.Dropout(0.2)
        self.fc_flat = nn.Linear(embedding_dim, embedding_dim)
        self.fc = nn.Linear(embedding_dim, num_class) # max_len*embedding_dim
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.embedding(x)
        x = self.fc_flat(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = x.mean(axis = 1)
        # x = torch.flatten(x, 1)
        x = self.fc(x)
        return(x)

Our model begins with the embedding layer where each word is looked up in an embedding table and turned into a learned vector of size embedding_dim. Immediately after embedding, we pass each token’s embedding through a small fully-connected layer then a ReLU activation, the fully connected layer lets the model learn a richer representation before pooling. We then pass the embedding into a dropout layer where 20% of the embedding vectors are randomly zeroed. This is a form of regularization step meant to help us not be over-reliant on certain tokens. Our mean-pool layer reduces our dimension by averaging all token embeddings so each song is now a fixed-size vector. Finally, our linear layer gives us our probabilities for each genre.

Let’s have a look at it!

vocab_size = len(tokenizer.vocab)
embedding_dim = 32
num_class = len(genres)

text_model = TextClassificationModel(vocab_size, embedding_dim, max_len, num_class).to(device)

summary(text_model, input_Size = (8, max_len))

=================================================================
Layer (type:depth-idx)                   Param #
=================================================================
TextClassificationModel                  --
├─Embedding: 1-1                         976,736
├─Dropout: 1-2                           --
├─Linear: 1-3                            1,056
├─Linear: 1-4                            132
├─ReLU: 1-5                              --
=================================================================
Total params: 977,924
Trainable params: 977,924
Non-trainable params: 0
=================================================================

We have a huge amount of trainable parameters! We could make this architecture more lightweight by changing the size of our embedding dimension.

Below, we define our training loop which can be used for all of our three models that we will define shortly. We define an accuracy function that we will use to evaluate the accuracy of our model and another to evaluate the per class accuracy.

def train(model, dataloader, mode="lyrics", vocab_freq=False):
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    loss_fn = torch.nn.CrossEntropyLoss()

    epoch_start_time = time.time()
    # keep track of some counts for measuring accuracy
    total_acc, total_count = 0, 0
    
    for X, y in dataloader:
        # unpack and move to device
        tokens, engineered = X
        y = y.to(device)

        if mode == "lyrics":
            """
            if vocab_freq:
                vocab = build_vocab_from_iterator(tokens, specials=["<unk>"], min_freq = 50)
                tokens = torch.tensor(vocab)
            """
            data = tokens.to(device)
        elif mode == "engineered":
            data = engineered.to(device)
        else:
            data = X

        # zero gradients
        optimizer.zero_grad()
        # form prediction on batch
        predicted_label = model(data)
        # evaluate loss on prediction
        loss = loss_fn(predicted_label, y)
        # compute gradient
        loss.backward()
        # take an optimization step
        optimizer.step()
                
        # for printing accuracy
        total_acc += (predicted_label.argmax(1) == y).sum().item()
        total_count += y.size(0)

    print(f'| epoch {epoch:3d} | train accuracy {total_acc/total_count:8.3f} | time: {time.time() - epoch_start_time:5.2f}s')

def accuracy(model, dataloader, mode="lyrics"):
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for X, y in dataloader:
            # unpack and move to device
            tokens, engineered = X
            y = y.to(device)

            if mode == "lyrics":
                data = tokens.to(device)
            elif mode == "engineered":
                data = engineered.to(device)
            elif mode == "both":
                data = X

            predicted_label = model(data)
            total_acc += (predicted_label.argmax(1) == y).sum().item()
            total_count += y.size(0)
    return total_acc/total_count

def per_class_accuracy(model, dataloader, mode="lyrics", num_classes=4):
    model.eval()
    correct = [0] * num_classes
    total   = [0] * num_classes

    with torch.no_grad():
        for X, y in dataloader:
            tokens, engineered = X
            y = y.to(device)

            if mode == "lyrics":
                data = tokens.to(device)
            elif mode == "engineered":
                data = engineered.to(device)
            else:
                data = X

            outputs = model(data)
            preds = outputs.argmax(dim=1)

            for cls in range(len(correct)):
                mask = (y == cls)
                total[cls] += mask.sum().item()
                correct[cls] += ((preds == cls) & mask).sum().item()

    return {
        cls: (correct[cls] / total[cls] if total[cls] > 0 else 0.0)
        for cls in range(len(correct))
    }

Now that we have those functions, lets jump right in and see how our model does when training on lyrics!

EPOCHS = 25
for epoch in range(1, EPOCHS + 1):
    train(text_model, train_loader, "lyrics")
    print("     test accuracy  ", accuracy(text_model, val_loader))

| epoch   1 | train accuracy    0.379 | time:  4.47s
     test accuracy   0.3540097474523704
| epoch   2 | train accuracy    0.398 | time:  3.97s
     test accuracy   0.39787328311918474
| epoch   3 | train accuracy    0.425 | time:  3.94s
     test accuracy   0.4262295081967213
| epoch   4 | train accuracy    0.457 | time:  4.03s
     test accuracy   0.42977403633141337
| epoch   5 | train accuracy    0.498 | time:  4.06s
     test accuracy   0.46256092157731504
| epoch   6 | train accuracy    0.556 | time:  3.77s
     test accuracy   0.4980062029242357
| epoch   7 | train accuracy    0.600 | time:  3.75s
     test accuracy   0.538325210456358
| epoch   8 | train accuracy    0.642 | time:  3.64s
     test accuracy   0.5578201151971643
| epoch   9 | train accuracy    0.674 | time:  3.64s
     test accuracy   0.5604785112981834
| epoch  10 | train accuracy    0.695 | time:  3.67s
     test accuracy   0.5746566238369517
| epoch  11 | train accuracy    0.712 | time:  3.68s
     test accuracy   0.5813026140894993
| epoch  12 | train accuracy    0.729 | time:  3.74s
     test accuracy   0.5795303500221533
| epoch  13 | train accuracy    0.745 | time:  4.02s
     test accuracy   0.5724412937527692
| epoch  14 | train accuracy    0.757 | time:  3.94s
     test accuracy   0.5777580859548073
| epoch  15 | train accuracy    0.767 | time:  4.04s
     test accuracy   0.5764288879042977
| epoch  16 | train accuracy    0.782 | time:  3.85s
     test accuracy   0.5746566238369517
| epoch  17 | train accuracy    0.795 | time:  3.86s
     test accuracy   0.5755427558706248
| epoch  18 | train accuracy    0.799 | time:  3.91s
     test accuracy   0.5684536996012406
| epoch  19 | train accuracy    0.813 | time:  4.09s
     test accuracy   0.5755427558706248
| epoch  20 | train accuracy    0.821 | time:  3.99s
     test accuracy   0.5737704918032787
| epoch  21 | train accuracy    0.831 | time:  4.46s
     test accuracy   0.5693398316349136
| epoch  22 | train accuracy    0.840 | time:  4.20s
     test accuracy   0.5742135578201152
| epoch  23 | train accuracy    0.849 | time:  4.23s
     test accuracy   0.5622507753655295
| epoch  24 | train accuracy    0.857 | time:  4.39s
     test accuracy   0.561807709348693
| epoch  25 | train accuracy    0.859 | time:  4.26s
     test accuracy   0.562693841382366

accuracy(text_model, val_loader)

0.5666814355338945

An accuracy around 56% may not seem all that great at first glance… however, lets remember our base rate was 36%, so despite the fact that we don’t have a particularly high accuracy we can still say that this model is successful!

Let’s look at our accuracy on each of our genres. A quick reminder that our genre keys are: - hip hop: 0 - jazz: 1 - reggae: 2 - rock: 3

per_class_accuracy(text_model, val_loader, mode="lyrics")

{0: 0.47701149425287354,
 1: 0.5816326530612245,
 2: 0.5031055900621118,
 3: 0.6090686274509803}

Even our weakest genre (hip hop at around 48%) comfortably exceeds the base rate! Our model is indeed learning useful signals from the lyrics. Our best performances were on jazz and rock that may suggest that those lyrics have more distinct stylistic patterns. Hip hop and reggae, on the other hand, may have suffered because of slang or patois lyrics or possibly thematic overlap.

Engineered Features Classification

Let’s tackle using our engineered features to try and determine song genres!

class MetadataClassificationModel(nn.Module):

    def __init__(self, num_features, num_class):
        super().__init__()
    
        self.pipeline = nn.Sequential(
            nn.Linear(num_features, 18), 
            nn.ReLU(),
            nn.Linear(18, 12), 
            nn.ReLU(),
            nn.Linear(12, 8), 
            nn.ReLU(),
            nn.Linear(8, num_class)
            )

    def forward(self, x):
        return self.pipeline(x)

    def predict(self, x): 
        return self.score(x) > 0

This is a pretty simple architecture for our engineered features of which there are twenty-two. We are using a series of fully-connected linear layers, each punctuated by a ReLU nonlinearity activation function.

num_features = len(engineered_features)

meta_model = MetadataClassificationModel(num_features, num_class).to(device)
summary(meta_model, input_Size = (8, max_len))

=================================================================
Layer (type:depth-idx)                   Param #
=================================================================
MetadataClassificationModel              --
├─Sequential: 1-1                        --
│    └─Linear: 2-1                       414
│    └─ReLU: 2-2                         --
│    └─Linear: 2-3                       228
│    └─ReLU: 2-4                         --
│    └─Linear: 2-5                       104
│    └─ReLU: 2-6                         --
│    └─Linear: 2-7                       36
=================================================================
Total params: 782
Trainable params: 782
Non-trainable params: 0
=================================================================

This model is pretty lightweight compared to the lyric based model. Lets see how it performs!

EPOCHS = 25
for epoch in range(1, EPOCHS + 1):
    train(meta_model, train_loader, "engineered")
    print("     test accuracy  ", accuracy(meta_model, val_loader, "engineered"))

| epoch   1 | train accuracy    0.459 | time:  4.19s
     test accuracy   0.5002215330084182
| epoch   2 | train accuracy    0.589 | time:  4.00s
     test accuracy   0.615861763402747
| epoch   3 | train accuracy    0.636 | time:  4.65s
     test accuracy   0.6278245458573327
| epoch   4 | train accuracy    0.643 | time:  3.95s
     test accuracy   0.6371289322108994
| epoch   5 | train accuracy    0.645 | time:  3.90s
     test accuracy   0.6371289322108994
| epoch   6 | train accuracy    0.650 | time:  3.75s
     test accuracy   0.640230394328755
| epoch   7 | train accuracy    0.649 | time:  3.73s
     test accuracy   0.6357997341603899
| epoch   8 | train accuracy    0.652 | time:  3.78s
     test accuracy   0.642002658396101
| epoch   9 | train accuracy    0.649 | time:  3.51s
     test accuracy   0.6468763845813026
| epoch  10 | train accuracy    0.651 | time:  3.73s
     test accuracy   0.640230394328755
| epoch  11 | train accuracy    0.649 | time:  3.52s
     test accuracy   0.6513070447496677
| epoch  12 | train accuracy    0.654 | time:  3.54s
     test accuracy   0.641116526362428
| epoch  13 | train accuracy    0.654 | time:  3.74s
     test accuracy   0.6504209127159947
| epoch  14 | train accuracy    0.655 | time:  3.39s
     test accuracy   0.6526362428001772
| epoch  15 | train accuracy    0.656 | time:  3.48s
     test accuracy   0.6424457244129376
| epoch  16 | train accuracy    0.654 | time:  3.31s
     test accuracy   0.6464333185644661
| epoch  17 | train accuracy    0.657 | time:  3.27s
     test accuracy   0.6451041205139566
| epoch  18 | train accuracy    0.654 | time:  3.38s
     test accuracy   0.6442179884802836
| epoch  19 | train accuracy    0.658 | time:  3.29s
     test accuracy   0.642002658396101
| epoch  20 | train accuracy    0.660 | time:  3.32s
     test accuracy   0.6446610544971201
| epoch  21 | train accuracy    0.658 | time:  3.24s
     test accuracy   0.6477625166149756
| epoch  22 | train accuracy    0.657 | time:  3.14s
     test accuracy   0.6482055826318122
| epoch  23 | train accuracy    0.658 | time:  3.21s
     test accuracy   0.6477625166149756
| epoch  24 | train accuracy    0.656 | time:  3.17s
     test accuracy   0.6455471865307931
| epoch  25 | train accuracy    0.655 | time:  3.09s
     test accuracy   0.6530793088170137

accuracy(meta_model, val_loader, "engineered")

0.6530793088170137

Woah! Only using metadata, we achieved around 65% accuracy! This much better than our base rate, and higher than the lyrics only classification approach.

per_class_accuracy(meta_model, val_loader, mode="engineered")

{0: 0.5689655172413793,
 1: 0.6224489795918368,
 2: 0.6128364389233955,
 3: 0.7242647058823529}

We are also outperforming all of our base rates for each genre! Once again rock is our highest performer (around 72%) showing its distinction from other genres in categories like instrumentalness, energy, movement/places, etc.

Combined Feature Classification

We have now explored successful approaches using lyrics and using metadata. Lets see how we perform when we combine the two!

class CombinedNet(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim, num_class, num_features):
        super().__init__()
    
        # engineered features pipeline
        self.eng_pipeline = nn.Sequential(
            nn.Linear(num_features, 18), 
            nn.ReLU(),
            nn.Linear(18, 12), 
            nn.ReLU(),
            nn.Linear(12, 8)
            )
        
        # text pipeline 
        self.embedding = nn.Embedding(vocab_size+1, embedding_dim)
        self.relu = nn.ReLU()
        self.fc = nn.Linear(embedding_dim, 8)

        # combine the two pipelines
        self.combine = nn.Sequential(
            nn.Linear(16, 12), 
            nn.ReLU(),
            nn.Linear(12, 8), 
            nn.ReLU(),
            nn.Linear(8, num_class)
        )
    
    def forward(self, x):
        x_text, x_eng = x
        x_text = x_text.to(device)  
        x_eng = x_eng.to(device)
        
        # text pipeline:
        x_text = self.embedding(x_text)
        x_text = self.relu(x_text)
        x_text = x_text.mean(axis = 1)
        x_text = self.fc(x_text)

        # engineered features pipeline:
        x_eng = self.eng_pipeline(x_eng)

        # then, combine them with: 
        x_comb = torch.cat([x_text, x_eng], dim = 1).to(device)
        
        # pass x_comb through a couple more fully-connected layers and return output
        return self.combine(x_comb)

The main ideas from the other pipelines remain. We first train separately following similar procedures to above, then we concatenate the features and pass them through several more fully-connected layers. Notably changes come in our text pipeline where we removed a fully connected layer and our dropout. These changes were a result of trail and error testing. Additionally, we bring our separate pipelines together before they are compressed back into our four-class classification.

combined_model = CombinedNet(vocab_size, embedding_dim, num_class, num_features).to(device)
summary(combined_model, input_Size = (8, max_len))

=================================================================
Layer (type:depth-idx)                   Param #
=================================================================
CombinedNet                              --
├─Sequential: 1-1                        --
│    └─Linear: 2-1                       414
│    └─ReLU: 2-2                         --
│    └─Linear: 2-3                       228
│    └─ReLU: 2-4                         --
│    └─Linear: 2-5                       104
├─Embedding: 1-2                         976,736
├─ReLU: 1-3                              --
├─Linear: 1-4                            264
├─Sequential: 1-5                        --
│    └─Linear: 2-6                       204
│    └─ReLU: 2-7                         --
│    └─Linear: 2-8                       104
│    └─ReLU: 2-9                         --
│    └─Linear: 2-10                      36
=================================================================
Total params: 978,090
Trainable params: 978,090
Non-trainable params: 0
=================================================================

Evidently, our model once again has a huge amount of trainable parameters. Lets see how they do!

EPOCHS = 25
for epoch in range(1, EPOCHS + 1):
    train(combined_model, train_loader, "both")
    print("     test accuracy  ", accuracy(combined_model, val_loader, "both"))

| epoch   1 | train accuracy    0.495 | time:  5.70s
     test accuracy   0.5423128046078866
| epoch   2 | train accuracy    0.564 | time:  5.37s
     test accuracy   0.5560478511298184
| epoch   3 | train accuracy    0.571 | time:  5.38s
     test accuracy   0.5724412937527692
| epoch   4 | train accuracy    0.584 | time:  5.38s
     test accuracy   0.5799734160389898
| epoch   5 | train accuracy    0.594 | time:  5.33s
     test accuracy   0.5516171909614532
| epoch   6 | train accuracy    0.611 | time:  5.45s
     test accuracy   0.5990252547629596
| epoch   7 | train accuracy    0.631 | time:  5.34s
     test accuracy   0.6322552060256978
| epoch   8 | train accuracy    0.644 | time:  5.38s
     test accuracy   0.615861763402747
| epoch   9 | train accuracy    0.654 | time:  5.48s
     test accuracy   0.6473194505981391
| epoch  10 | train accuracy    0.669 | time:  5.75s
     test accuracy   0.6464333185644661
| epoch  11 | train accuracy    0.677 | time:  6.30s
     test accuracy   0.6566238369517058
| epoch  12 | train accuracy    0.692 | time:  5.92s
     test accuracy   0.6544085068675233
| epoch  13 | train accuracy    0.699 | time:  6.66s
     test accuracy   0.6575099689853788
| epoch  14 | train accuracy    0.711 | time:  6.78s
     test accuracy   0.6575099689853788
| epoch  15 | train accuracy    0.720 | time:  7.07s
     test accuracy   0.6326982720425344
| epoch  16 | train accuracy    0.729 | time:  7.26s
     test accuracy   0.66371289322109
| epoch  17 | train accuracy    0.741 | time:  7.66s
     test accuracy   0.6548515728843598
| epoch  18 | train accuracy    0.748 | time: 10.54s
     test accuracy   0.6526362428001772
| epoch  19 | train accuracy    0.759 | time:  9.17s
     test accuracy   0.6570669029685423
| epoch  20 | train accuracy    0.765 | time:  8.58s
     test accuracy   0.6641559592379265
| epoch  21 | train accuracy    0.776 | time:  6.94s
     test accuracy   0.6495347806823216
| epoch  22 | train accuracy    0.784 | time:  6.50s
     test accuracy   0.6486486486486487
| epoch  23 | train accuracy    0.789 | time:  6.38s
     test accuracy   0.66371289322109
| epoch  24 | train accuracy    0.800 | time:  6.19s
     test accuracy   0.6544085068675233
| epoch  25 | train accuracy    0.811 | time:  6.24s
     test accuracy   0.6260522817899867

accuracy(combined_model, val_loader, "both")

0.6260522817899867

After twenty-five epochs, we achieved an accuracy of around 62% which is slightly disappointing. If we look closely at the evolution of the our testing accuracy, we were steadily in the region of around 65% for a while. This drop may be a part of the training process or may be a reflection of the beginning of our model overfitting to the training data.

per_class_accuracy(combined_model, val_loader, mode="both")

{0: 0.6839080459770115,
 1: 0.4872448979591837,
 2: 0.6977225672877847,
 3: 0.7046568627450981}

Curse you Jazz! We are doing significant better on all the other genres apart from jazz. This may be because of jazz lyrics being slightly less theme driven combined with the atypical structure of jazz music. Maybe swing rhythms, tempo changes and odd time signatures don’t fit neatly into any given category along with the lyrics.

Closing Remarks

Through our explorations, our metadata-only model yielded the highest accuracy around 65%, our combined network was not far behind with around 62% accuracy, and trailing begin was a purely lyric based approach that achieved 56% accuracy. Despite their varying and somewhat low accuracies, all the model outperformed the base rate of 36%. We narrowed down our search space to only four genres, hip hop, jazz, reggae, and rock. Of these genres, we had the easiest time distinguishing rock and reggae, with jazz proving especially hard to nail down.

This blogpost was obviously an exercise in crafting deep learning pipelines through applying themes we learned in readings and class (i.e. mean-pooling, non-linear activation functions, hidden layers, etc.) and simple trial and error. One large takeaway I had was that feature concatenation does not necessarily guarantee improved model accuracy, and in some cases can provide more noise than clarity to the model.

Some possible continuations for this project could be to modify model depth and complexity, implement vocabulary thresholds, expand the number of genres we look at, etc.